These materials have been taken from the Software Carpentry: R Novice Lesson. You can find the original materials here
This lesson will cover some basic functions that can be used to manipulate data in R. Again, we will be using the gapminder data set, which includes country information on GDP, population, etc.
There are five main functions we’ll be talking about today, each allowing us to manipulate data frames. These five functions are:
select() – Choose columns (variables or attributes) from our data framefilter() – Choose rows (samples or observations) from our data framemutate() – Create new columns, based on existing onesgroup_by() – Group rows based on a particular column/value within that columnsummarize() – Perform some function on the grouped dataleft_join() – Combine two tables based on a shared columnIf you haven’t already, make sure you have dplyr() and gapminder() installed and loaded with the following commands:
# Download the packages
# install.packages(c("dplyr", "gapminder"))
# Load the packages for use
library(dplyr)
library(gapminder)
Let’s take a quick look at our data frame to remind ourselves of its structure. We do this using the head() command, which will display the first 10 rows (given by n = 10) of our data frame.
head(gapminder, n = 10)
# A tibble: 10 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
The first function we’ll be using is select(). This function let’s us pick columns from our data frame, based on name (e.g. year) or by index (e.g. 3).
Let’s try using select() to pick out a few columns: “country”, “year”, “lifeExp”, and “pop”. We’ll be assigning these columns to a new data frame, gapminder_select. Then we’ll use head() to see if it worked.
# select() code here
gapminder_select <- select(gapminder, country, year, lifeExp, pop)
# Check the data frame
head(gapminder_select, n = 10)
# A tibble: 10 x 4
country year lifeExp pop
<fct> <int> <dbl> <int>
1 Afghanistan 1952 28.8 8425333
2 Afghanistan 1957 30.3 9240934
3 Afghanistan 1962 32.0 10267083
4 Afghanistan 1967 34.0 11537966
5 Afghanistan 1972 36.1 13079460
6 Afghanistan 1977 38.4 14880372
7 Afghanistan 1982 39.9 12881816
8 Afghanistan 1987 40.8 13867957
9 Afghanistan 1992 41.7 16317921
10 Afghanistan 1997 41.8 22227415
As you can see, our new data frame contains only a subset of the columns from the original data frame, based on the names we provided in the select() command.
Here we’ll also introduce another great feature of dplyr(): the pipe ( %>% ). This symbol sends or pipes an object (e.g. a data frame like gapminder) INTO a function (e.g. select()).
So, the above select() command can be rewritten as follows (NOTE: the “.” is a placeholder, which represents the object being piped). Again, we can check our result using head().
# select() using pipe syntax
gapminder_pipe <- gapminder %>% select(., country, year, lifeExp, pop)
head(gapminder_pipe, n = 10)
# A tibble: 10 x 4
country year lifeExp pop
<fct> <int> <dbl> <int>
1 Afghanistan 1952 28.8 8425333
2 Afghanistan 1957 30.3 9240934
3 Afghanistan 1962 32.0 10267083
4 Afghanistan 1967 34.0 11537966
5 Afghanistan 1972 36.1 13079460
6 Afghanistan 1977 38.4 14880372
7 Afghanistan 1982 39.9 12881816
8 Afghanistan 1987 40.8 13867957
9 Afghanistan 1992 41.7 16317921
10 Afghanistan 1997 41.8 22227415
We can actually simplify the above command further - dplyr’s functions such as select() are smart enough that you don’t actually need to include the “.” placeholder, as shown below.
# select() using pipe syntax w/out a placeholder
gapminder_pipe2 <- gapminder %>% select(country, year, lifeExp, pop)
head(gapminder_pipe2, n = 10)
# A tibble: 10 x 4
country year lifeExp pop
<fct> <int> <dbl> <int>
1 Afghanistan 1952 28.8 8425333
2 Afghanistan 1957 30.3 9240934
3 Afghanistan 1962 32.0 10267083
4 Afghanistan 1967 34.0 11537966
5 Afghanistan 1972 36.1 13079460
6 Afghanistan 1977 38.4 14880372
7 Afghanistan 1982 39.9 12881816
8 Afghanistan 1987 40.8 13867957
9 Afghanistan 1992 41.7 16317921
10 Afghanistan 1997 41.8 22227415
Using the select() command and pipe (%>%) notation, pick the following columns from the gapminder data frame, assign them to a new variable (we’ll use x), and display the results using head(x, n = 10). Columns to choose are:
# Answer here:
x <- select()
So we’ve covered selecting columns, but what about rows? This is where filter() comes in. This function allows us to choose rows from our data frame using some logical criteria. An example is filtering for rows in which the country is Canada. This can also be applied to numerical values, such as the year being equal to 1967, or life expectancy greater than 30.
NOTE: In R, equality (e.g. country is Canada, year is 1967) is done using a double equals sign (==).
Let’s go through a couple examples.
# Filter rows where country is Canada
gapminder_canada <- gapminder %>% filter(country == "Canada")
head(gapminder_canada, n = 10)
# A tibble: 10 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Canada Americas 1952 68.8 14785584 11367.
2 Canada Americas 1957 70.0 17010154 12490.
3 Canada Americas 1962 71.3 18985849 13462.
4 Canada Americas 1967 72.1 20819767 16077.
5 Canada Americas 1972 72.9 22284500 18971.
6 Canada Americas 1977 74.2 23796400 22091.
7 Canada Americas 1982 75.8 25201900 22899.
8 Canada Americas 1987 76.9 26549700 26627.
9 Canada Americas 1992 78.0 28523502 26343.
10 Canada Americas 1997 78.6 30305843 28955.
Let’s try another one, this time filtering on life expectancy above a certain threshold:
# Filter for rows where life expectancy is greater than 50
gapminder_LE <- gapminder %>% filter(lifeExp > 50)
head(gapminder_LE, n = 10)
# A tibble: 10 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Albania Europe 1952 55.2 1282697 1601.
2 Albania Europe 1957 59.3 1476505 1942.
3 Albania Europe 1962 64.8 1728137 2313.
4 Albania Europe 1967 66.2 1984060 2760.
5 Albania Europe 1972 67.7 2263554 3313.
6 Albania Europe 1977 68.9 2509048 3533.
7 Albania Europe 1982 70.4 2780097 3631.
8 Albania Europe 1987 72 3075321 3739.
9 Albania Europe 1992 71.6 3326498 2497.
10 Albania Europe 1997 73.0 3428038 3193.
We can also filter with multiple arguments, each separated by a comma:
# filter() for Canada and life expectancy greater than 80
gapminder_C_LE <- gapminder %>% filter(country == "Canada", lifeExp > 80)
head(gapminder_C_LE, n = 10)
# A tibble: 1 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Canada Americas 2007 80.7 33390141 36319.
Use filter() to choose data for African countries, from the year 1980 and onwards.
# Challenge 2 code here
x <- filter()
Let’s say we now want to calculate the GDP in billions, which is done by mutiplying the GDP per capita by the population, then dividing by 1 billion (1 * 10^9). mutate() will perform this calculation on each row in the data frame, one row at a time (i.e. row-wise). The code below will calculate the GDP in billions:
gdpPercap * pop / 10^9# Use mutate() to calculate GDP in billions
gapminder_gdpBil <- gapminder %>% mutate(gdp_billion = gdpPercap * pop / 10^9)
head(gapminder_gdpBil, n = 10)
# A tibble: 10 x 7
country continent year lifeExp pop gdpPercap gdp_billion
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6.57
2 Afghanistan Asia 1957 30.3 9240934 821. 7.59
3 Afghanistan Asia 1962 32.0 10267083 853. 8.76
4 Afghanistan Asia 1967 34.0 11537966 836. 9.65
5 Afghanistan Asia 1972 36.1 13079460 740. 9.68
6 Afghanistan Asia 1977 38.4 14880372 786. 11.7
7 Afghanistan Asia 1982 39.9 12881816 978. 12.6
8 Afghanistan Asia 1987 40.8 13867957 852. 11.8
9 Afghanistan Asia 1992 41.7 16317921 649. 10.6
10 Afghanistan Asia 1997 41.8 22227415 635. 14.1
We’ve seen that pipes ( %>% ) can be used to send an object such as a data frame into a function, such as select(), or filter(). But they can also be used to send the output of one function into another function. This allows us to chain together multiple commmands, without the need for intermediate variables.
Let’s take a look at this in an example.
# select() the five columns, and filter() for Canada
gapminder_multi <- gapminder %>%
select(country, year, lifeExp, pop, gdpPercap) %>%
filter(country == "Canada")
head(gapminder_multi, n = 10)
# A tibble: 10 x 5
country year lifeExp pop gdpPercap
<fct> <int> <dbl> <int> <dbl>
1 Canada 1952 68.8 14785584 11367.
2 Canada 1957 70.0 17010154 12490.
3 Canada 1962 71.3 18985849 13462.
4 Canada 1967 72.1 20819767 16077.
5 Canada 1972 72.9 22284500 18971.
6 Canada 1977 74.2 23796400 22091.
7 Canada 1982 75.8 25201900 22899.
8 Canada 1987 76.9 26549700 26627.
9 Canada 1992 78.0 28523502 26343.
10 Canada 1997 78.6 30305843 28955.
We can further expand on this by incorporating our mutate() command from earlier, linking multiple functions into a single command. Be sure to indent (TAB key) when moving to a new line after a pipe.
# select() the four columns, filter() for Canada, and calculate GDP in billions
gapminder_multi_2 <- gapminder %>%
select(country, year, lifeExp, pop, gdpPercap) %>%
filter(country == "Canada") %>%
mutate(gdp_billion = gdpPercap * pop / 10^9)
head(gapminder_multi_2, n = 10)
# A tibble: 10 x 6
country year lifeExp pop gdpPercap gdp_billion
<fct> <int> <dbl> <int> <dbl> <dbl>
1 Canada 1952 68.8 14785584 11367. 168.
2 Canada 1957 70.0 17010154 12490. 212.
3 Canada 1962 71.3 18985849 13462. 256.
4 Canada 1967 72.1 20819767 16077. 335.
5 Canada 1972 72.9 22284500 18971. 423.
6 Canada 1977 74.2 23796400 22091. 526.
7 Canada 1982 75.8 25201900 22899. 577.
8 Canada 1987 76.9 26549700 26627. 707.
9 Canada 1992 78.0 28523502 26343. 751.
10 Canada 1997 78.6 30305843 28955. 878.
These functions allow us to work on our data in specific groups. For example, we can use group_by() to group observations by country, then calculate the average life expectancy for each country.
# group_by() country, calculate average life expectancy
gapminder_grp <- gapminder %>%
group_by(country) %>%
summarise(mean(lifeExp))
head(gapminder_grp, n = 10)
# A tibble: 10 x 2
country `mean(lifeExp)`
<fct> <dbl>
1 Afghanistan 37.5
2 Albania 68.4
3 Algeria 59.0
4 Angola 37.9
5 Argentina 69.1
6 Australia 74.7
7 Austria 73.1
8 Bahrain 65.6
9 Bangladesh 49.8
10 Belgium 73.6
Let’s do another example, again grouping by country. This time, we’ll calculate the mean and standard deviation of the GDP per capita. We’ll also specify the column names inside of the summarise() command.
gapminder_mean_sd <- gapminder %>%
group_by(country) %>%
summarise(mean_gdp = mean(gdpPercap), sd_gdp = sd(gdpPercap))
head(gapminder_mean_sd, n = 10)
# A tibble: 10 x 3
country mean_gdp sd_gdp
<fct> <dbl> <dbl>
1 Afghanistan 803. 108.
2 Albania 3255. 1192.
3 Algeria 4426. 1310.
4 Angola 3607. 1166.
5 Argentina 8956. 1863.
6 Australia 19981. 7815.
7 Austria 20412. 9655.
8 Bahrain 18078. 5415.
9 Bangladesh 818. 235.
10 Belgium 19901. 8391.
Let’s say you have a data frame you’ve generated, containing a list of gene IDs and some corresponding value, such as expression. And you have another table that contains many (e.g. all) human gene IDs, as well as the names for those genes. For example the gene ID “ENSG00000012048” corresponds to the gene BRCA1. Now let’s say you want to map between the two tables using the gene ID, to create a single table with ID-Expression-Name for each gene. That’s what left_join() is for!
This function takes one data frame “x” and using a specified column, looks for matching entries in “y”. Note that the output data frame will contain all rows and columns from “x”, as well as all columns from “y”, but only matching rows from “y”.
First let’s load some simple example data to play with:
fruits1 <- read.csv("../data/fruits_table1.csv")
fruits2 <- read.csv("../data/fruits_table2.csv")
head(fruits1)
FruitID Fruit_Exp
1 f1 1
2 f2 -1
3 f3 0
4 f4 4
5 f5 -8
head(fruits2)
FruitID FruitName
1 f1 apple
2 f2 orange
3 f3 banana
4 f4 pear
5 f5 blueberry
6 f6 melon
Now we can use left_join() to combine the two tables, based on matching values in a specified column. The syntax is as follow:
left_join(fruits1, fruits2, by = "FruitID")
FruitID Fruit_Exp FruitName
1 f1 1 apple
2 f2 -1 orange
3 f3 0 banana
4 f4 4 pear
5 f5 -8 blueberry
Extra: Note that you can have different column names in each of your data frames, and still join the tables together. The syntax for this is:
left_join(x, y, by = c("columnX" = "columnY"))
Now let’s use all the commands we’ve covered and combine them with pipes into a single statement.
Let’s say we want calculate the mean and SD of the GDP (in billions) for each country, but only considering data from 1980 and onwards. We can accomplish this all in one step as follows.
# select() columns, filter() by year, calculate GDP in billions, mean() and sd() of GDP in billions
gapminder_final <- gapminder %>%
select(country, year, pop, gdpPercap) %>%
filter(year >= 1980) %>%
mutate(gdp_billion = gdpPercap * pop / 10^9) %>%
group_by(country) %>%
summarise(mean_gdpBillion = mean(gdp_billion), sd_gdpBillion = sd(gdp_billion))
head(gapminder_final, n = 10)
# A tibble: 10 x 3
country mean_gdpBillion sd_gdpBillion
<fct> <dbl> <dbl>
1 Afghanistan 16.4 7.66
2 Albania 13.1 4.84
3 Algeria 149. 33.2
4 Angola 28.9 15.5
5 Argentina 353. 91.5
6 Australia 478. 154.
7 Austria 225. 50.3
8 Bahrain 12.4 5.15
9 Bangladesh 120. 54.3
10 Belgium 272. 54.6